DSCI 100 Group 17 Report: Classifying Celestial Bodies from Spectral Characteristics¶
Group members:
- Aidan Wong
- Ben Tyler
- Tyson Quan
Introduction¶
Stars are large spheres of hot gas that emit heat and light into space. They are composed of mostly hydrogen, with some helium and other elements. The sun is an example of a star and is the closest star to Earth (NASA).
Galaxies are clusters of planets, stars, gasses, and dust that are all held together by gravity. Galaxies are very large and emit light from the stars and other things that it contains. The Milky Way Galaxy is an example of a galaxy and is the one that Earth is a part of (NASA).
Quasars are the core of active galaxies and they are powered by supermassive black holes. They emit immense amounts of heat and light due to the friction of material being drawn in. The closest quasar to Earth is called 3C 273 and can be seen with an 8-inch telescope (Cooper 2018).
The classification of celestial objects into sta rs, galaxies, and quasars has been pivotal for the understanding of planet Earth's positioning within space. It has led to key insights such as the discovery that the Andromeda galaxy is separate from our own, and this classification continues to be essential for astrological research (Clarke 2020).
In this report, we will use data on celestial objects to answer the following question: "Based on its redshift and brightness in different wavelengths of light, what type of celestial object is this?"
Our data set is from Sloan Digital Sky Survey Data Release 16. It was collected by the Sloan Digital Sky Survey Telescope; a powerful telescope aimed at measuring spectral characteristics. It contains data on light emitted from galaxies, quasars, and stars, including redshift, which reflects how quickly an object moves (Fedesoriano, 2022), and brightness in five wavelengths of light. Below are a list of the variables collected as well as what they represent:
- obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS
- alpha = Right Ascension angle
- delta = Declination angle
- u = Ultraviolet filter in the photometric system
- g = Green filter in the photometric system
- r = Red filter in the photometric system
- i = Near Infrared filter in the photometric system
- z = Infrared filter in the photometric system
- run_ID = Run Number used to identify the specific scan
- rereun_ID = Rerun Number to specify how the image was processed
- cam_col = Camera column to identify the scanline within the run
- field_ID = Field number to identify each field
- spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)
- class = object class (galaxy, star, or quasar object)
- redshift = redshift value based on the increase in wavelength
- plate = plate ID, identifies each plate in SDSS
- MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken
- fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation
We will focus on the u, g, r, i, z, and redshift variables to help predict the classification of the class variable.
Import Libraries¶
import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.compose import make_column_selector
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Settings on Juypter Notebook for printing and plotting graphs
set_config(transform_output="pandas")
alt.data_transformers.disable_max_rows()
# Seed to ensure reproducible report
np.random.seed(1234)
Methods and Results¶
This section consist of 3 main parts:
- Loading and Cleaning Data
- Exploratory Data Analysis
- Classifcation Analysis
Methods¶
In this section, we will explain the method we used to illustrate our findings.
Firstly, We would use 6 variables (columns): u, g, r, i, z and redshift. The first five variables are brightness values in different bands of light: ultraviolet, green, red, near-infrared, and infrared (Fukugita et al., 1996). They are measured in magnitude, which is unitless and reflects photon abundance (SDSS Voyages, 2024a). These magnitudes could help determine object class because quasars, galaxies, and stars can have unique colours (SDSS, n.d.a). We would also include redshift, which indicates the lengthening of an object's light wavelengths due to the expansion of the universe (SDSS, 2024b). Galaxies and quasars often have higher redshift values than stars, so higher redshift could indicate them (Crockett, 2021).
After gathering the necessary data, we proceeded with data preprocessing, which involved filtering and renaming the columns to ensure comprehensibility and ease of use. Once the dataset was cleaned and prepared, we conducted an exploratory analysis to gain a thorough understanding of the data. Initially, we examined the data types of each column and assessed the distribution of classes. Since we planned to perform K-Nearest Neighbor (KNN) classification, achieving a balanced distribution of classes was crucial for accurate results. To identify suitable variables for classification, we employed visualization techniques such as density plots, which helped us analyze the distinct characteristics exhibited by each variable.
After completing the exploratory analysis, we proceeded with the classification analysis using six selected variables. As the class distribution in the original dataset was imbalanced, we performed upsampling to create a balanced dataset. Subsequently, we followed the standard procedure for KNN classification. This involved standardizing the numerical variables and splitting the dataset into training and testing sets. To determine the optimal parameter k, we conducted cross-validation on the training dataset.
To visualize the results of the cross-validation, we created a plot of k values against estimated accuracy, which aided in selecting the appropriate k value. Next, we evaluated the performance of our classification model using scoring functions and cross-tabulation analysis to gain a comprehensive understanding of the model's results. Additionally, we plotted a pairplot to explore the relationships between each parameter used in the classification model. Based on these findings, we repeated the same procedure for a new set of chosen variables.
Upon completing both models, we reached conclusions based on our findings, which are presented in the following section.
1. Loading and Cleaning Data¶
# Load in the data file from the web
url="https://drive.google.com/file/d/1LM-kB1xP90O9RBY5yjRP1mET_BKOOhxC/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
star_data = pd.read_csv(url) #Citation: (Pandas, 2019)
star_data.head()
| objid | ra | dec | u | g | r | i | z | run | rerun | camcol | field | specobjid | class | redshift | plate | mjd | fiberid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1237666301628060000 | 47.372545 | 0.820621 | 18.69254 | 17.13867 | 16.55555 | 16.34662 | 16.17639 | 4849 | 301 | 5 | 771 | 8168632633242440000 | STAR | 0.000115 | 7255 | 56597 | 832 |
| 1 | 1237673706652430000 | 116.303083 | 42.455980 | 18.47633 | 17.30546 | 17.24116 | 17.32780 | 17.37114 | 6573 | 301 | 6 | 220 | 9333948945297330000 | STAR | -0.000093 | 8290 | 57364 | 868 |
| 2 | 1237671126974140000 | 172.756623 | -8.785698 | 16.47714 | 15.31072 | 15.55971 | 15.72207 | 15.82471 | 5973 | 301 | 1 | 13 | 3221211255238850000 | STAR | 0.000165 | 2861 | 54583 | 42 |
| 3 | 1237665441518260000 | 201.224207 | 28.771290 | 18.63561 | 16.88346 | 16.09825 | 15.70987 | 15.43491 | 4649 | 301 | 3 | 121 | 2254061292459420000 | GALAXY | 0.058155 | 2002 | 53471 | 35 |
| 4 | 1237665441522840000 | 212.817222 | 26.625225 | 18.88325 | 17.87948 | 17.47037 | 17.17441 | 17.05235 | 4649 | 301 | 3 | 191 | 2390305906828010000 | GALAXY | 0.072210 | 2123 | 53793 | 74 |
# Cleaning data
# Filter relevant columns and rename columns for a more comprehensible understanding
star_filtered = (
star_data.loc[:, ["u", "g", "r", "i", "z", "redshift", "class"]]
.rename(columns={
"u":"Ultraviolet",
"g":"Green",
"r":"Red",
"i":"Near Infrared",
"z":"Infrared",
"redshift":"Redshift",
"class":"Class"
})
)
star_filtered.head()
| Ultraviolet | Green | Red | Near Infrared | Infrared | Redshift | Class | |
|---|---|---|---|---|---|---|---|
| 0 | 18.69254 | 17.13867 | 16.55555 | 16.34662 | 16.17639 | 0.000115 | STAR |
| 1 | 18.47633 | 17.30546 | 17.24116 | 17.32780 | 17.37114 | -0.000093 | STAR |
| 2 | 16.47714 | 15.31072 | 15.55971 | 15.72207 | 15.82471 | 0.000165 | STAR |
| 3 | 18.63561 | 16.88346 | 16.09825 | 15.70987 | 15.43491 | 0.058155 | GALAXY |
| 4 | 18.88325 | 17.87948 | 17.47037 | 17.17441 | 17.05235 | 0.072210 | GALAXY |
2. Exploratory Data Analysis¶
This section performs a summary of the data set that is relevant for exploratory data analysis related to the planned analysis
# General understanding of the dataset
star_filtered.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ultraviolet 100000 non-null float64 1 Green 100000 non-null float64 2 Red 100000 non-null float64 3 Near Infrared 100000 non-null float64 4 Infrared 100000 non-null float64 5 Redshift 100000 non-null float64 6 Class 100000 non-null object dtypes: float64(6), object(1) memory usage: 5.3+ MB
# Understand the proportion of classes in the dataset to determine whether we have to upsample the data or not
star_filtered["Class"].value_counts(normalize=True)
Class GALAXY 0.51323 STAR 0.38096 QSO 0.10581 Name: proportion, dtype: float64
From the above information we are able to understand a data types of our dataset and the proportion of the classes.
With these information we can conclude that we would have to upsample the dataset to have a fair classification of celestial bodies.
This section creates a visualization of the dataset that is relevant for exploratory data analysis related to the planned analysis
# Standardizing the data for plotting in the below sections
preprocessor_keep_all = make_column_transformer(
(StandardScaler(), ['Ultraviolet', 'Green', 'Red', 'Near Infrared', 'Infrared', "Redshift"]),
remainder="passthrough",
verbose_feature_names_out=False
)
# Use Fit to compute all the neccessary values to scale the data
preprocessor_keep_all.fit(star_filtered)
# transform function to apply the standardization
star_scaled = preprocessor_keep_all.transform(star_filtered)
star_scaled.head()
| Ultraviolet | Green | Red | Near Infrared | Infrared | Redshift | Class | |
|---|---|---|---|---|---|---|---|
| 0 | 0.065633 | -0.272293 | -0.287759 | -0.230598 | -0.226791 | -0.389669 | STAR |
| 1 | -0.194147 | -0.103121 | 0.317192 | 0.580613 | 0.705310 | -0.390143 | STAR |
| 2 | -2.596213 | -2.126356 | -1.166443 | -0.746957 | -0.501159 | -0.389555 | STAR |
| 3 | -0.002769 | -0.531149 | -0.691260 | -0.757044 | -0.805267 | -0.257025 | GALAXY |
| 4 | 0.294775 | 0.479099 | 0.519436 | 0.453794 | 0.456601 | -0.224904 | GALAXY |
In the below code, we would like to plot a density plot as density plots are more effective for comparing multiple distributions.
With this density distribution, we would like to identify any variables that exhibits difference distributions between the different clusters (E.g. Star, Galaxy or Quasar).
# Plotting the distribution of different characteristics values based on their class.
star_exploration_plot = alt.Chart(
star_scaled.melt(
id_vars=["Class"],
var_name="Characteristics",
value_name="Values",
)
).transform_density(
"Values",
groupby=["Class", "Characteristics"],
as_=["Values", "Density"]
).mark_area(opacity=0.6).encode(
x=alt.X("Values").scale(base=10),
y=alt.Y("Density:Q", title="Density"),
color="Class:N"
).properties(
width=150,
height=150
).facet(
alt.Facet(
"Characteristics",
sort=star_scaled.columns[:-1].tolist()
),
columns=6
).resolve_scale(
# We are setting the x-scale to "independent" since we standardized the rating values,
# which means that their original range (which is what we show here) does not matter
x="independent",
y="independent"
)
star_exploration_plot